fix: fix a race condition in the tracker #19513

grobinson-grafana · 2025-10-15T23:36:39Z

What this PR does / why we need it:

This commit fixes a race condition in the push tracker where both the p.done and p.err channels could have a value at the same time. This could cause the distributor to return success when it should instead have returned an error.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

This commit fixes a race condition in the push tracker where both the p.done and p.err channels could have a value at the same time. This could cause the distributor to return success when it should instead have returned an error.

claude · 2025-11-13T22:22:13Z

Claude finished @grobinson-grafana's task —— View job

Code Review Complete ✅

I've thoroughly reviewed this PR that fixes a race condition in the push tracker. Here's my detailed analysis:

Review Checklist:

Read and understand the changed files
Analyze the race condition fix implementation
Review test coverage and quality
Check for potential bugs or security issues
Assess performance implications
Provide constructive feedback

Summary

This is an excellent fix that addresses a critical race condition in the distributor's push tracking mechanism. The change replaces an unsafe channel-based approach with a proper mutex-protected state machine.

Race Condition Analysis ⭐

The Problem: The original pushTracker had a fundamental race condition where both p.done and p.err channels could receive values simultaneously:

// OLD CODE - UNSAFE
func (p *pushTracker) doneWithResult(err error) {
    if err == nil {
        if p.streamsPending.Dec() == 0 {
            p.done <- struct{}{}  // ← Could send here
        }
    } else {
        if p.streamsFailed.Inc() == 1 {
            p.err <- err          // ← AND here simultaneously
        }
    }
}

The Wait() method used a select that could receive from either channel, potentially returning success when an error occurred.

The Fix: The new basicPushTracker uses proper synchronization:

Mutex protects all state changes
Atomic operations replaced by mutex-protected counters
Channels used only for signaling, not data transfer
Error stored in protected state rather than sent through channel

Code Quality Assessment

✅ Strengths

Thread Safety: Proper mutex usage eliminates the race condition
Clear API: Well-defined PushTracker interface with good documentation
Error Handling: First-error semantics are correctly implemented
Panic Behavior: Consistent with sync.WaitGroup patterns
Performance: Fast-path optimization with done bool field
Resource Management: Proper channel creation/closing

✅ Test Coverage

The test suite is comprehensive and well-designed:

Edge Cases: Empty tracker, context cancellation, error propagation
Race Conditions: Fuzz testing with concurrent goroutines
API Misuse: Proper panic testing for invalid usage patterns
Error Semantics: First-error behavior verification
Concurrency: Multi-waiter scenarios with randomized timing

The fuzz test is particularly valuable for catching race conditions.

Technical Implementation

Correct Synchronization Pattern

func (t *basicPushTracker) Wait(ctx context.Context) error {
    t.mtx.Lock()
    if t.firstErr != nil || t.n == 0 {
        res := t.firstErr
        t.mtx.Unlock()
        return res  // Fast path - no channel operations needed
    }
    t.mtx.Unlock()
    
    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-t.doneCh:
        return t.firstErr  // No mutex needed - immutable after close
    case <-t.errCh:
        return t.firstErr  // No mutex needed - immutable after close
    }
}

This pattern is correct because:

State is checked under mutex first (fast path)
Channels are used only for blocking/signaling
Once channels are closed, firstErr becomes immutable
Context cancellation is properly handled

Memory Safety

The implementation correctly handles the fact that once either channel is closed, the firstErr field becomes immutable, eliminating the need for mutex protection in the select branches.

Performance Considerations

Positive impacts:

✅ Fast path avoids channel operations when result is immediately available
✅ Single mutex is more efficient than atomic operations on multiple fields
✅ No unnecessary channel buffer allocations

Neutral impact:

The mutex overhead is negligible compared to distributed push operations
Channel-based signaling is still used but safely

Security Assessment

✅ No security concerns identified
✅ Proper error propagation prevents information leakage
✅ No resource leaks in error paths
✅ Panic behavior is consistent and documented

Integration Analysis

The changes integrate well with the existing codebase:

Interface abstraction allows future tracker implementations
Existing call sites require minimal changes
Maintains backward compatibility in behavior
Proper error propagation to callers

Minor Suggestions

Documentation: Consider adding a comment explaining why both doneCh and errCh are needed (early error termination vs. completion)
Naming: The function comment references [newSimplePushTracker] but should reference [basicPushTracker]

Verdict: APPROVED ✅

This is a high-quality fix that:

✅ Correctly solves the race condition
✅ Maintains existing API semantics
✅ Includes comprehensive tests
✅ Follows Go best practices
✅ Has no performance regressions
✅ Improves system reliability

The implementation demonstrates solid understanding of Go concurrency patterns and the fix directly addresses the core issue described in the PR.

pull-request-size bot added the size/L label Oct 15, 2025

grobinson-grafana force-pushed the grobinson/fix-race-condition-in-tracker branch 8 times, most recently from 5b34fb2 to b3d4ad5 Compare October 16, 2025 12:00

grobinson-grafana marked this pull request as ready for review October 16, 2025 12:57

grobinson-grafana requested a review from a team as a code owner October 16, 2025 12:57

grobinson-grafana force-pushed the grobinson/fix-race-condition-in-tracker branch from b3d4ad5 to 2301021 Compare October 16, 2025 13:48

fix: fix a race condition in the tracker

5dc2cd5

This commit fixes a race condition in the push tracker where both the p.done and p.err channels could have a value at the same time. This could cause the distributor to return success when it should instead have returned an error.

grobinson-grafana force-pushed the grobinson/fix-race-condition-in-tracker branch from 2301021 to 5dc2cd5 Compare October 16, 2025 13:50

trevorwhitney approved these changes Oct 20, 2025

View reviewed changes

grobinson-grafana added the claude-review Requests a review from Claude Code when added label Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix a race condition in the tracker #19513

fix: fix a race condition in the tracker #19513

grobinson-grafana commented Oct 15, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: fix a race condition in the tracker #19513

Are you sure you want to change the base?

fix: fix a race condition in the tracker #19513

Conversation

grobinson-grafana commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Summary

Race Condition Analysis ⭐

Code Quality Assessment

✅ Strengths

✅ Test Coverage

Technical Implementation

Correct Synchronization Pattern

Memory Safety

Performance Considerations

Security Assessment

Integration Analysis

Minor Suggestions

Verdict: APPROVED ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grobinson-grafana commented Oct 15, 2025 •

edited

Loading

claude bot commented Nov 13, 2025 •

edited

Loading